Fast q-gram Mining on SLP Compressed Strings
نویسندگان
چکیده
We present simple and efficient algorithms for calculating qgram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T , we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T . Computational experiments show that our algorithm and its variation are practical for small q, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.
منابع مشابه
Speeding Up q-Gram Mining on Grammar-Based Compressed Texts
We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T...
متن کاملAlgorithms and data structures for grammar - compressed strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملData Structures for Grammar-compressed Strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملComputing q-Gram Non-overlapping Frequencies on SLP Compressed Texts
Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all qgrams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, ...
متن کاملFinding Characteristic Substrings from Compressed Texts
Text mining from large scaled data is of great importance in computer science. In this paper, we consider fundamental problems on text mining from compressed strings, i.e., computing a longest repeating substring, longest non-overlapping repeating substring, most frequent substring, and most frequent non-overlapping substring from a given compressed string. Also, we tackle the following novel p...
متن کامل